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Abstract 

Background: A phylogenetic tree, showing ancestral relations among organisms, is commonly represented as a 
rooted tree with sets of bifurcating branches (dichotomies) for simplicity, although polytomies (multifurcating 
branches) may reflect more accurate evolutionary relationships. To represent the true evolutionary relationships, it 
is important to systematically identify the polytomies from a bifurcating tree and generate a taxonomy-compatible 
multifurcating tree. For this purpose we propose a novel approach, "PolyPhy", which would classify a set of 
bifurcating branches of a phylogenetic tree into a set of branches with dichotomies and polytomies by considering 
genome distances among genomes and tree topological properties. 

Results: PolyPhy employs a machine learning technique, BLR (Bayesian logistic regression) classifier, to identify 
possible bifurcating subtrees as polytomies from the trees resulted from ComPhy. Other than considering genome- 
scale distances between all pairs of species, PolyPhy also takes into account different properties of tree topology 
between dichotomy and polytomy, such as long-branch retraction and short-branch contraction, and quantifies 
these properties into comparable rates among different sub-branches. We extract three tree topological features, 
'LR' (Leaf rate), IntraR' (Intra-subset branch rate) and InterR' (Inter-subset branch rate), all of which are calculated 
from bifurcating tree branch sets for classification. We have achieved F-measure (balanced measure between 
precision and recall) of 81% with about 0.9 area under the curve (AUC) of ROC. 

Conclusions: PolyPhy is a fast and robust method to identify polytomies from phylogenetic trees based on 
genome-wide inference of evolutionary relationships among genomes. The software package and test data can be 
downloaded from http://digbio.missouri.edu/ComPhy/phyloTreeBiNonBi-1 .0.zip. 



Background 

Evolutionary histories (or phylogenies) are an integral 
part of many studies in modern biology. Phylogenies 
have long been used to study the relationships among 
species. Most phylogenetic trees are in the form of a 
rooted tree where the leaves represent the species, and 
internal nodes represent their hypothetical ancestors. 
Consequently, the phylogenetic tree becomes a branch- 
ing diagram showing the inferred evolutionary relation- 
ships among various biological species based on 
similarities and differences in their physical and/or 
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genetic characteristics. The two most commonly seen 
topological sub-structures of trees are dichotomy and 
polytomy. Although resolving phylogenetic trees into 
dichotomous branching patterns has been a general goal 
in phylogenetics [1], the enforced resulting trees might 
be stretched beyond the necessary evolutionary 
assumptions. 

Polytomies are multifurcating (as opposed to bifurcat- 
ing) relationships in phylogenetic hypotheses and occur 
for two reasons: First, polytomies can result from poor 
resolution of true bifurcating relationships (due to lack 
of sufficient data or inappropriate analysis of characters), 
and these are "soft" polytomies; second, polytomies 
can represent bona fide multifurcations (multiple, simul- 
taneous divergence events), and these are "hard" 
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polytomies [2]. For example, simultaneous divergences 
of three or more lineages can occur due to the isolation 
of subpopulations within a widespread species by sud- 
den meteorological or geological events resulting in 
reproductive isolation due to the rapid expansion of the 
population into open territory. Traditionally, researchers 
assumed that hard polytomies were exceptions and rare 
in nature and treated them as expression of ignorance 
[3,4]. With this assumption, researchers traditionally 
sought to resolve polytomies in binary trees, often by 
increasing the number of characters analyzed. However, 
more recent evidence shows the existence of simulta- 
neous speciation events such as with the Drosophila 
simulans species complex [5,6] shown in Figure 1 and 
African cichlid fishes which speciate quickly after their 
home lakes formed in Africa resulting in several phylo- 
genetic polytomies [7]. 

Microbial taxonomy imposes many difficulties in 
determining whether generated phylogenetic tree 
branches are bifurcations or multifurcations. The major 
problem is that microbes, although morphologically 
diverse, may have fewer morphological characters than 
macrobes. Consequently, the rate of change in morpho- 
logical characters may be greater in macrobes than in 
microbes [8]. Thus, microbes might not have enough 
diverse morphologies to do deep taxonomy accurately 
beyond the level of phylum. As a result, grouping of 
microbes into many taxonomy trees shows up as multi- 
furcating clads instead of resolved bifurcating trees. 
Also, bacteria evolve in a different manner from eukar- 
yotes. Eukaryotes predominately evolve by point muta- 
tions (changes in single base-pair in DNA) whereas 
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Figure 1 Phylogenetic tree of Drosophila melanogaster species 
group. Redrawn from [5]. This phylogeny shows the relationships 
among some species in the Drosophila group in which there is a 
hard polytomy branch. The reason could be that the island species 
D. mauritiana and D. sechellia branched off from the mainland 
species D. simulans in a narrow timeframe, such that it is 
impossible to distinguish which species branched off first and 
which second. 



bacteria often evolve by inserting or deleting large 
chunks of DNA [9-11]. Thus, microbial phylogeny ana- 
lysis of available molecular sequences, rRNA and pro- 
tein, have difficulties to convincingly resolve in the 
many specific branching orders of the microbial divi- 
sions due to incongruence of sequence diversity. This 
suggests that the base of the microbial tree is best seen 
as a polytomy, an expansive radiation that has not been 
resolved with current data [12]. As a result, many 
sources of species trees, such as the NCBI Taxonomy 
Database, use polytomies for species trees. In fact, 64% 
of the branch points in the NCBI taxonomy Database 
[13] have three or more children. In addition, a multi- 
furcated species tree is indeed valuable to infer gene 
duplication events and orthologs [14-17]. As with any 
approach that imposes structure on the data [18-20], 
bifurcations are the imposition of method, not necessa- 
rily the reality. Due to the nature of algorithmic pair- 
wise comparison of many phylogeny methods, often 
phylogeny visualization tools display a tree as a bifurcat- 
ing cladogram, which is a directed bifurcating tree with 
a unique node corresponding to the (usually imputed) 
most recent common ancestor of all the entities at the 
leaves of the tree. Thus, all tree-building methods will 
force a binary tree on the data without considering the 
possibility that resulting trees might stretch beyond the 
assumption. A justification may be that it is easier to 
work on a strictly binary set of nodes although there are 
possibilities of the polytomies existing in trees [21]. 
However, even if there is a real dichotomous structure 
in the data, unresolved nodes will often occur mostly at 
or near the terminal branches due to lack of informa- 
tion. In this case, it is better to use polytomies instead 
of forcing dichotomies artificially. As suggested, poly- 
tomies might be quite common in microbial taxonomy 
trees since many times evolutionary relationships of 
interested species cannot be fully resolved to separate 
descending branches or difference of timeframes 
between two divergences. 

The most popular approach to obtaining polytomies is 
through forcing a threshold on bootstrap values of 
branches generated from a consensus tree, which is pro- 
duced from a set of gene trees based on different selec- 
tions of gene subsets. Then multiple bifurcating 
branches with low bootstrap values would be combined 
into multifurcating branches. This method has been 
employed in a number of popular phylogenetic tools, 
such as PAUP [22], Phylip [23] and Splitstree [22-24]. 
The result is often ad hoc and coarse grain without any 
significant statistical basis. Another approach for polyt- 
omy identification from bifurcating branches is through 
literature mining and manual curation of biology experts 
in order to match up with the general understanding of 
the taxonomy tree of life [5-7] which is laborious and 
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time consuming. Until now, there is still no systematic 
algorithm and method to automate the process of polyt- 
omy identification from a bifurcating tree. Therefore, it 
is essential to construct a systematic algorithm to iden- 
tify polytomies from an existing bifurcating tree to facili- 
tate the phylogenetic analysis. 

With fast accumulations of molecular sequencing data, 
especially using whole-genome sequences from different 
organisms, inferring molecular evolutionary relationships 
using genomic data has been a popular phylogeny analysis 
approach in recent years. Often subsets of gene or protein 
families would be selected to infer the so-called "gene 
trees" for interested organisms to establish phylogenetic 
relationships. These analyses typically result in bifurcating 
phylogeny trees containing only dichotomies. Though 
gene trees are good sources for hypothesizing species 
trees, they introduce other problems, such as the incon- 
gruence of tree branches in the phylogenetic trees due to 
different selections of gene subsets. Hence, it is logical to 
introduce the polytomies instead of trying to resolve every 
branch when there is incongruence among tree branches 
of different trees from the same set of species. 

In this study, we propose an innovative strategy, called 
'PolyPhy' (phylogeny construction with polytomy identi- 
fication) to identify polytomies from a phylogenetic tree. 
In PolyPhy, the process starts with phylogenetic tree 
construction for many species and then combines a fea- 
ture extraction process with a Bayesian classification 
method to identify polytomies from a bifurcating phylo- 
genetic tree. The PolyPhy method takes into account 
different tree topological properties between dichotomy 
and polytomy, such as long-branch retraction and short- 
branch contraction, and quantifies these properties as 
comparable rates among different sub-branches. More 



specifically, PolyPhy utilizes ComPhy [25], a microbial 
tree reconstruction tool that we developed based on 
genome-structure features. After a phylogenetic tree is 
built, two sets of bifurcating branches, presumably 
dichotomies and polytomies, are extracted based on Ber- 
geys' Taxonomy. Then three tree structure features, 'LR' 
(Leaf rate), 'IntraR' (Intra-subset branch rate) and 
'InterR' (Inter-subset branch rate), are calculated from 
bifurcating tree branch sets based on distance matrices 
derived from ComPhy. The next step is BLR (Bayesian 
Logistic regression) classification, which is a recursive 
training process to select best classification parameters 
to fit the input features and generate a model for classi- 
fication. The last step is identifying potential polytomies 
and checking for accuracies. 

Results & discussion 

Optimizing Jackknife ability 

In ComPhy, genome structures such as breakpoints of 
gene ordering are constructed to infer genome distance 
based on orthologs defined in a pair of genomes. There- 
fore, we treated individual orthologs as samples for jack- 
knife procedure. We resampled the dataset by selecting 
30% to 90% of the orthologs incrementally. Figure 2 
shows the distribution of average accuracy for trees gen- 
erated by different percentage selections of orthologs 
after 50 iterations. It is noticeable that the 60% cutoff 
worked well since the increase of accuracy beyond this 
point almost reached a plateau even when more data 
was included. Therefore, a 60% cutoff was used as the 
default for genome feature jackknife phylogeny analysis. 
With 60% jackknife resampling, we could generate 
unbiased tree replicates for the next step of classification 
training and testing. 
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Figure 2 Accuracy vs. different ortholog percentage selection curves. The horizontal-axis is percentage (ranging from 30% to 90%) of 
orthologs being selected to produce the tree and y-axis is the accuracy. This figure shows two different accuracy curves of different ortholog 
percentage selections. The blue curve shows the accuracy of the tree generated by ComPhy in comparison to the taxonomy. And the red curve 
is the average bootstrap values from the tree branches for each generated tree. 
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Classification signal strength in three features 

To efficiently utilize each of the features for the classifier 
described in "Material & Methods," it is important to 
know how strongly each feature can individually distin- 
guish between dichotomy and polytomy. Figure 3 repre- 
sents the distribution of each feature value in each class 
(dichotomy or polytomy) in the two datasets introduced in 
"Material & Methods." Three features of Part A in Figure 
3 are generated from the smaller dataset consisting of 83 
genomes; and genome distances were calculated based on 
traditional MSA (multiple sequence alignment) using 16s 
rRNA sequences. Part B of Figure 3 is generated from the 
larger dataset, which has 398 microbial species, and gen- 
omes distances were calculated based on composite dis- 
tance from ComPhy. Figures 3A-1 and 3B-1 correspond to 
the leaf feature rate distribution, while 3A-2 and 3B-2 cor- 
respond to intra-subset branch rate with 3A-3 and 3B-3 



from inter-subset branch rate. These figures clearly 
demonstrate the distinguishable divisions of feature rate 
distributions between dichotomy and polytomy. Further- 
more, similar distribution patterns were observed for all 
three features between set A figures and set B figures even 
though these distributions were generated based on differ- 
ent genome distance calculations. This indicates that not 
only extracted features have strong discerning power, but 
also the performance is not significantly subject to differ- 
ent genome distance methods. 

BLR parameter optimization 

As in other multivariate statistical models, the perfor- 
mance of BLR in classification depends on the combina- 
tion of several parameters. Here, BLR involves mostly 
two classes of parameters: 1) the prior variance para- 
meter V and 2) the error tuning parameter t V is a 
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Figure 3 Classification strength of each of three tree topological features. Part A of the figures is generated using 83 microbial species 
based on MSA (multiple seguence alignment) on 16S rRNA seguences. A-1 shows leaf rate topological feature generated per dichotomies and 
polytomies, while A-2 and A-3 use IntraR and InterR features, respectively. Part B of the figures is generated using all 398 microbial species based 
on composite distance calculation from ConPhy. B-1 shows leaf rate topological feature generated per dichotomies and polytomies, while B-2 
and B-3 use IntraR and InterR features, respectively. 
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regulation parameter that controls the balance and tra- 
deoff among individual feature distribution variance and 
their prior skews; V also tries to minimize the overall 
distribution variance. How to choose a good set of prior 
variance is not well understood and varies from dataset 
to dataset. Therefore, we chose variance (V) values 
based on commonly used variance range and applied 
the best fit. For V: V i+1 = TV t , where V x = 0.002 and i 
- 1,..., 20, i is the number of iterations. The error tuning 
t is another parameter to help confirm the right selec- 
tion of other parameters by trying different accuracy 
measurements. Five criteria can be used for error tuning 
of t with 0 representing a no tuning threshold: 

1) sum of error = (FP + FN); 

2) balanced error rate BER = (sensitivity + specificity)/ 2; 

3) T11U = 2*TP - FP [26]; 

4) Fl measure [27]; 

5) T13U = 20*TP - FP [28], where TP is true positive 
value, FP is false positive value and FN is false negative 
value. 

Accordingly, two parameters, V and t, should be opti- 
mized. The parameter optimization was performed by 
using a grid search within a limited range. To minimize 
over-fitting of the prediction model, three-fold crossover 
validation was used to investigate the training set. To 
evaluate predicted accuracy, an F-measure was used [27] 
as the weighted harmonic mean of precision and recall; 
and it is defined as follows: 

precision * recall 2 * TP 

F1(F — measure) = 2*- = . 

precision + recall 2 * TP + FP + FN 

During the BLR classification model training, each 
data point represents a 2-level bifurcation with three 
leaf nodes (a triplet). If the bifurcation triplet is in the 
gold standard dichotomy set from Bergey tree, one must 
assign class label 1; otherwise -1. Figure 4 shows the 
profile of predicting the accuracy of the three-fold 
cross-validation on the training set versus the variations 
of the parameters V and t Obviously, the accuracy pro- 
file has a maximum values peak at (V, t) = (0.128, 3), 
indicating that the optimal values of V and t for con- 
structing BLR models are 0.128 and 3, respectively. 

Another parameter we provide in the PolyPhy algo- 
rithm is to let users choose types of priors for BLR clas- 
sification. Two types of prior are commonly employed, 
Laplace (sparse data and less features) and Gaussian 
(approximate normal distribution). Users can try both 
prior types to achieve the best prediction result based 
on their data property. 

Prediction ability 

Using the optimal values of V and t, the polytomy clas- 
sification model was constructed based on the training 



set by using the BLR learning algorithm provided by the 
BBR package [29]. To minimize data dependence on the 
prediction model, 10 tree replicates were prepared 
through jackknife procedure, five from each dataset (see 
"Material & Methods")- Each training set from data set 

1 consisted of around 200 Bergey system confirmed 
bifurcating triplets, while each training set from data set 

2 consisted of around 60 confirmed bifurcating triplets. 
All training sets were half confirmed dichotomies and 
half confirmed polytomies. The classification results are 
listed in Table 1. For all test models, the balanced 
accuracies were > 80.24%, and the AUC value was 
83.31%. Training sets from data set 2 had higher accu- 
racy than sets from data set 1 because of their less var- 
iance from the smaller genome set. 

To further improve the classification accuracy, three 
topological features were added per different distance to 
the BLR classifier. Distance measures GDD, GBD and 
GCD from ComPhy were used with all three tree fea- 
tures included per distance respectively (nine features in 
total). The classification accuracy reached as high as 
0.87 for balanced accuracy, 0.95 precision, 0.85 recall, 
and 0.95 AUC value for all the training sets from data 
set 1, but the classification accuracy showed only around 
5% improvement for training sets from data set 2. 

Multifucation tree generation 

With polytomies being identified through the BLR clas- 
sification, PolyPhy can make a multifurcating tree. The 
PolyPhy team took the most intuitive approach by using 
the average of the branch lengths from a bifurcation 
identified as a polytomy. For example, from Figure 5, 
Part I, where S 1} S 2 , S 3 and S 4 are branch lengths from a 
bifurcation, S 1} S 2 and S 3 were averaged and the branch 
S 4 was removed in order to make the subtree like Part 
II in Figure 5. Note, in the classification process, Poly- 
Phy generated either independent classifier for each 
jackknife-generated tree or a combined classifier for one 
overall tree to accommodate those who just want to see 
one final tree produced in the end when doing analysis. 
Therefore, in generating the final tree, a consensus mul- 
tifurcating tree was produced from several independent 
multifurcating trees and multiple classifiers were 
applied. We applied the accuracy evaluation from Com- 
Phy [25] and evaluated the accuracies of the multifurcat- 
ing trees against Bergey's taxonomy system. A 
comparable accuracy was achieved with 30% more poly- 
tomies being validated due to the fact that ComPhy pre- 
viously only evaluated dichotomy branches. 

Conclusion 

Polyphy, a standalone tool for phylogeny reconstruction 
and polytomy identification, is robust and easy to use 
for studying evolution. It can construct a consensus tree 
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without requiring multiple sequence alignment and can 
also identify polytomies without requiring manual inter- 
ferences. It is a fully automated tool. The phylogeny 
reconstruction implements a jackknife recursive sam- 
pling procedure; then recursively trains the BLR classifi- 
cation model for polytomy identification. It allows users 



Table 1 Classification results of the test sets 





Test 


Accuracy, 


Precision, 


Recall, 


ROC AUC 




set 


% 


% 


% 


value 




data 1.1 


81.34 


75.55 


88.1 


87.89 




data 1.2 


83.38 


79.15 


88.09 


89.39 




data 1.3 


80.24 


88.39 


73.47 


85.71 




data 1.4 


80.29 


89.89 


72.55 


88.09 




data 1.5 


80.49 


70.67 


93.48 


83.31 


combined 


datal 


81.15 


80.73 


83.138 


86.88 




dataZI 


88.24 


88.24 


88.24 


92.52 




data2.2 


86.09 


85.36 


86.84 


91.01 




data2.3 


89.21 


90.01 


88.42 


94.52 




data2.4 


87.98 


87.24 


88.74 


92.54 




data2.5 


83.23 


84.24 


82.24 


85.12 


combined 


data2 


86.95 


87.02 


86.90 


90.74 



Precision is the true positive/(true positive + false positive). Recall is true 
positive/(true positive + false negative). AUC quantifies the overall ability of 
the test to discriminate between positive samples and negative samples. 



to infer a multifurcating phylogenetic tree for any set of 
microbial genomes of interest to study their evolution- 
ary relationships. Although the tool is built for recon- 
structing a multifurcating phylogeny from whole 
genome sequences, users can have a pre-generated 
bifurcating tree from any phylogenetic tool as the start- 
ing point for classification. We believe that this is a 
timely and necessary development as more and more 
microbes are sequenced daily (especially from the meta- 
genomic analysis of microbial communities) without 
reliable taxonomy established. 

Material & methods 

Taxon selections 

We used a set of 398 single-chromosome microbial gen- 
omes, as shown in Table 2, consisting of 31 Archaea 
and 367 bacterial genome sequences. These sequences 
were previously used in the development of ComPhy 
[25]. The distribution of phylum clad in this dataset 
represents the trend of microbial taxonomy classification 
[30], in which most genome sequences are from three 
phyla, Proteobacteria, Firmicutes and Actinobacteria. 

The second dataset consists of 83 microbial genomes 
[31], whose 16S rRNA sequences were obtained from 
the Greengenes microbial database [32]. These 83 
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/) If it is identified as polytomy 



Figure 5 Polytomy identification. Part I presents a three-level bifurcation subtree with 4 genome taxa A, B, C and D. X h X 2 and X 3 are genome 
distances calculated among A, B and C. And S 7 , S 2l S 3l S 4 and S 5 are branch lengths. If bifurcating subtree of A B and C is identified as a 
polytomy, then the sub-tree can be converted to a multifurcation subtree, shown as Part II. 



genomes are also part of the first data set of 398 micro- 
bial genomes, and they were used to generate phyloge- 
netic tree solely based on the 16S rRNA sequences. 

PolyPhy (phylogeny with polytomy identification) tool 

This research developed a machine-learning tool, Poly- 
Phy, for polytomy identification. PolyPhy starts with 
ComPhy reconstructing a large-scale microbial phylo- 
geny using whole-genome structural features. It obtains 
multiple unbiased trees through a jackknife procedure. 
PolyPhy uses three tree topological features to be dis- 
cussed in following sections, and through recursive 
trainings using a machine learning method to identify 
polytomy from the tree. The last step of the tool is 



Table 2 Taxon statistics of the 432 prokaryotic complete 
genomes 



Phylum 


C 


0 


F 


G 


S 


str 


A1 


1 


3 


4 


4 


7 


7 


A2 


8 


9 


12 


18 


23 


23 


A3 




1 


1 


1 


1 


1 


Subtotal 3 


10 


13 


17 


23 


31 


31 


B1 




1 


1 


1 


1 


1 


B2 




1 


1 


1 


1 


1 


B4 




2 


2 


2 


3 


4 


B6 




1 


1 


1 


2 


2 


B10 




3 


3 


8 


15 


19 


B11 




1 


1 


2 


4 


4 


B12 


5 


33 


53 


99 


157 


208 


B13 


3 


7 


14 


22 


58 


96 


B14 


3 


9 


15 


16 


31 


35 


B15 


1 


1 


1 


1 


1 


1 


B16 


1 


1 


2 


3 


7 


11 


B17 


1 


1 


2 


3 


7 


9 


B19 


1 


1 


1 


2 


2 


2 


B20 


3 


3 


5 


5 


6 


7 


B21 


1 


1 


1 


1 


1 


1 


Subtotal 15 


25 


66 


103 


167 


296 


401 


Total 18 


35 


79 


120 


190 


327 


432 


P = Phylum, C 


= Class, 0 


= Order, F 


= Family, G 


= Genus, S = Species, str = 



strain. A = Archaea and B = Bacteria. 



generating a meaningful multifurcating tree. No fixed 
threshold is imposed on any of the patterns to decide if 
a two-level bifurcation is a dichotomy or polytomy since 
every phylogenetic tree will have a different distance 
value and branch. Thus, through a machine-learning 
method, i.e., the BLR classifier, the extracted features 
can be clearly used to train the classification model 
without needing to know the exact cutoff values. 

Phylogenetic tree reconstruction 

Obtaining an optimal binary phylogenetic tree as input 
tree for PolyPhy is as important as identifying polytomy 
from the binary tree. It has been traditionally considered 
that likelihood-based methods have more accurate phylo- 
genetic relationship inferences than distance-based meth- 
ods such as Neighbor-Joining (NJ) [33,34]. However, in 
2010, Roch [35] showed that when sets of large-scale spe- 
cies are involved in phylogenetic analysis, the distance- 
based method proved much more effective, practical and 
surprisingly comparable in performance [36]. Thus, given 
the amount of genomes used in this study, distance- 
based phylogenetic inference was applied. 

Phylogenetic tree reconstruction using ComPhy 

ComPhy [25] is a stand-alone Java-based tool which uti- 
lizes a robust and efficient strategy called 'Gene Compo- 
site Distance' to combine different aspects of 
evolutionary relationships among genomes for produ- 
cing a phylogenetic tree from a given set of whole gen- 
ome sequences. Specifically, composite distance measure 
starts with an all-against-all pairwise genome compari- 
son using BLASTP [37]. In the second step, a distance 
matrix is calculated from three components, i.e., GDD 
(Gene Dispersion distance), GBD (Genome Breakpoint 
distance) and GCD (Gene Content distance). This dis- 
tance matrix is then fed into a distance-based algorithm, 
Neighbor- Joining (NJ) [33,34], using a third-party tool, 
Phylip [23], to produce phylogenetic trees. 

Phylogenetic tree reconstruction using 16s rRNA sequences 

Besides using the whole genome structural features to 
infer microbial phylogeny by ComPhy, we also applied 
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multiple-sequence alignment using 16s rRNA sequences 
from the second dataset of microbial genome (see more 
detail in Taxa selections section) for evolution distance 
inference. MUSCLE [38] was applied for multiple 
sequence alignments for datasets with the 83 microbes. 
Then a pairwise genome distance matrix was generated 
from the alignments using "distMat" from EMBOSS 
with Kimura correction [39]. Finally, this distance matrix 
was fed into Neighbor-Joining (NJ) [33,34], using Phylip 
[23] to produce a phylogenetic tree. 

Jackknife procedure in phylogeny 

Jackknifing is a statistical method to estimate the accu- 
racy of sample statistics by repeatedly using subsets of 
available data to produce the results. In order to obtain 
multiple meaningful phylogenetic microbial trees, Poly- 
Phy provides the user with capability to generate trees 
from the same dataset through a jackknife procedure to 
obtain robust and non-biased training datasets of bifur- 
cating subtrees from the trees. We subsequentially 
selected different subsets of whole genome gene sets in 
which ortholog ordering was the main component used 
in ComPhy distance measure to generate different trees. 
It is not a trivial task to apply the perturbation to ortho- 
log selection and assess the robustness of the data. 
Though some have tried to use jackknife in large-scale 
phylogeny analysis, the studies were not conclusive 
[40,41]. Here, we conducted some of the performance 
tests on jackknife procedure using 398 microbial 
genomes. 

The major steps are performed as follows: 1) Generat- 
ing a new set of orthologs by randomly selecting k% 
from overall orthologs of each genome. Orders of the 
selected orthologs are preserved with respect to their 
order in the original genomes. 2) Reconstructing a tree 
replicate from these genomes with new ortholog subsets. 
3) Repeat step 1 with same k value but different ran- 
domly chosen ortholog subsets 50 times to obtain 50 
different tree replicates. 4) Compute a consensus tree 
and corresponding confidence values on all internal 
branches by using the CONSENSE program in PHYLIP 
[23] from 50 tree replicates. 5) Repeat Step 1 with dif- 
ferent k values, ranging from 30 to 80. Through this 
procedure, an optimal selection of subset of orthologs 
can be determined for jackknife selection. 

Polytomy identification through classification 

We have developed a machine-learning process to carry 
out the task of identifying polytomies from given bifur- 
cation tree. Figure 5.1 is an example of the bifurcating 
tree with three levels of bifurcating branching, (((A, B), 
C), D). If evolutionary relationships among leaves gen- 
omes A, B and C cannot be fully resolved due to conflict 
supports with low bootstrap values or simply more likely 



simultaneous speciation, then we would classify this 
bifurcating branch as a polytomy subtree, as shown in 
Figure 5. II. We will utilize the BLR (Bayesian Logistic 
Regression) classification, a machine learning approach, 
to automatically learn and classify bifurcating branches 
into dichotomies or polytomies. After a classification 
model is trained from extracted features, we can classify 
a bifurcating branch into a dichotomy subtree or a 
polytomy subtree. The following shows the details of the 
procedure: 

(1) Bayesian logistic regression 

The Bayesian approach is attractive for its ability to 
incorporate prior knowledge into statistical inference; 
and the logistic regression can provide a simple but 
accurate model for the predictions with calibrated prob- 
abilities without extrapolating input datasets to a com- 
plex and higher dimension. Bayesian logistic regression 
utilizes the best features of both methods by employing 
the prior information about the success probability and 
recursively optimizing the prediction parameters to 
achieve the optimal classification/prediction rather than 
simple regression coefficient for classifiers [42-44]. BBR 
(Bayesian Logistic Regression Software) package [29] 
was used to train the model and classify the bifurcations 
for dichotomies and polytomies. 

(2) Classification training datasets 

The format of training data for classification is com- 
prised of various two-level bifurcating subtrees as shown 
in Figure 5.1. The gold standard for deciding whether or 
not a binary tree branching is a dichotomy or a polyt- 
omy is the tree topology provided by Bergey's microbial 
taxonomy classification system. For 398 microbial taxa, 
we constructed a bifurcating phylogenetic tree and were 
able to extract 370 dichotomies and 285 polytomies as 
training data. The dataset was split into 90% for model 
training and 10% for testing the accuracy of the predic- 
tion model. 

(3) Classification features 

In order to successfully train a classification model for 
bifurcating branches from a phylogenetic tree, there 
must be a meaningful set of features extracted from the 
bifurcating branches in a tree. Figure 5.1 gives a 
hypothetical bifurcation (a two-layer binary subtree in 
phylogram format) with four different genomes, A, B, C 
and D as leaf nodes, in which genome D is used as out- 
group taxa. Genome distances among A, B and C are 
represented as X lt X 2 and X 3 , which can be calculated 
from ComPhy or any other phylogeny construction tool 
preferred by the users. The branch lengths, S 1} S 2 , S 3 , S 4 
and S 5 are usually the branch lengths generated by dif- 
ferent tree generation algorithms. With this information 
given by tree topology, three topological features can be 
generated which are consistent and can be extracted 
from any bifurcating branch of the tree. They are LR 
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(Leaf rate) as Figure 6.1, IntraR (Intra-subset branch 
rate) as Figure 6.II and InterR (inter-subset branch rate) 
as Figure 6.IIL 

The common misconception of the branch length in 
phylogenetic tree is that it is an irrelevant factor for 
inferring species branching history compared to other 
types of topology information such as hierarchy of the 
branches. Although this view could simplify the presen- 
tation of the tree and still give the users a somewhat 
coarse and high-level examination to infer the evolution 
history using ordering of the taxa, the sensitivity of rela- 
tive evolution timing is lost due to this assumption. For 
the next two topological features, branch length must be 
considered as a strong supporting factor for evolution 
history inferences. The correct branch length is not a 
random assignment, but rather a statistical derivation 
from different phylogenetic tools proportional to the 
predicted or hypothetical evolutionary time between 
organisms or sequences based on the presumed evolu- 
tion model [45-47]. Therefore, the two patterns, IntraR 
and InterR extracted from the tree, use the idea of long- 
branch retraction and short-branch contraction for 
polytomy identification. 
(4) Feature LR generation 

This pattern studies the relative genome distances 
among three studied genomes. In Figure 6.1, if the 



bifurcations is a dichotomy, then X 3 , distance between 

genome A and genome B, will be the shortest genome 

distance among all three (X lf X 2 and X 3 ). Therefore, the 

summation of distance of A to C and B to C will always 

be greater than twice the distance of A to B. Also by 

normalizing with the average distance, this pattern can 

indicate the difference between a dichotomy and a 

Xi+X 2 -2*X 3 
polytomy: Leaf rate = — — — , where X u X 2 

(Ai + A 2 + A 3 J/3 
and X 3 are pair-wise genome distances among A, B and 
C (see Figure 6.1). 
(5) Feature IntraR generation 

This feature considers all the branches within the two- 
level bifurcation (Figure 6. II). If one assumes that the 
two-levels of branching from the top parent node in a 
given bifurcation correspond to two different steps of 
speciation, which is dichotomy, then one should see a 
clear separation between speciation of C and set of (A, 
B), and between A and B. Therefore, the branch of S 4 
attempts to retract and stretch as far as possible to have 
the same branch length of S 3 . On the contrary, if this 
subtree of bifurcation is indeed a polytomy, then length 
of the branch S 4 will be as short as possible so that the 
length Sj will be similar to that of S 2 in order to have 
simultaneous speciation. Hence, the following formula 
captures the contraction and retraction properties and 
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Figure 6 Polytomy identification using three different features. This graph presents three different topological tree structure features for the 
BLR classification. Part I, II and III show examples on how to calculate different values for each of three features for dichotomy bifurcation and 
polytomy bifurcation; and also how different branch contractions and retractions are presented in subtrees. 
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uses them to identify dichotomy and polytomy: 

Intra — subset branch length = ™ n $1, ^2)/max(Si / S 2 ) 

mm (S 3 , S 4 ) /max(S 3 , S 4 ) 
where Sj, S 2 , S 3 and S 4 are the branch lengths. Sj and S 2 
could have the same length, which has no impact on the 
formula. 

(6) InterR generation 

This last classifier feature covers the branch relation- 
ship between the two-level bifurcation and its level 
above through branch s5, as shown in Figure 6. III. If 
one assumes species A, B and C speciated at the same 
time, then the branch contraction property will force 
all branches including S 1} S 2 and S 3 to have similar 
lengths of branching time for species A, B and C to be 
clustered together into a polytomy. With another 
branch S 5 from one higher level of hierarchic topology, 
it is easier to see that the contraction of S 4 against S 5 
has an observable longer branch length. Most dichoto- 
mies show that S 5 is shorter than at least one of the 
intra-subset branches, while S 5 is much longer than all 
of the intro-subset branches for a polytomy. Thus: 

max(Si, S2, $3, S4) 



inter — subset branch length = 



S 5 



where S lt S 2 , S 3 , S 4 and S 5 are the branch lengths. 
(7) Classification procedure 

For BLR classifier, properly estimating the posteriori 
probability is the crucial step in the process. In order to 
have the maximum posteriori estimate of the selected 
class label for input data, an optimal value of prior var- 
iance is needed through a rigorous training. In PolyPhy, 
the recursive prior variance selection can be performed 
within a range of values through /c-fold cross-validation 
to obtain the optimal parameter settings for the training 
model. For each prior variance value, BLR trains k logis- 
tic regression models under a prior with that hyperpara- 
meter. Each model is trained on the union of k-1 of the 
subsets and tested on the remaining subset with each 
subset used as the test subset once. BLR then selects the 
prior variance that maximizes the average value of the 
log-likelihood of a training instance when it appears in 
the test subset. 

There are two ways to use the BLR classifier. One is 
the individual tree training and prediction. From each 
jackknife generated bifurcating tree, PolyPhy extracts all 
bifurcating subtrees. The gold standard matched 
dichotomies and polytomies are used as input to the 
classifier on the rest subsets. The best classification 
model is trained through 3-fold cross validation and 
obtained for each tree. The classifier is then applied to 
the rest of the bifurcating subtrees for each tree to iden- 
tify the polytomies and convert those bifurcating 
branches into polytomies. Subsequently a consensus tree 
can be obtained from all the multifurcating trees. With 



this final tree, the result tree can be compared to bifur- 
cating tree or consensus tree obtained from multiple 
bifurcating trees. Another simpler way of using the BLR 
classifier is to combine all the generated trees to create 
an overall consensus tree and extract one set of bifurcat- 
ing subtree input based on the gold standard Bergey 
bifurcation set. Next a classifier model is generated to 
classify the rest of the bifurcating subtrees for each tree 
to identify the polytomies and convert those bifurcating 
branches into polytomies. 

Availability 

PolyPhy and all the relevant resources are available 

at http://digbio.missouri.edu/ComPhy/phyloTreeBi- 
NonBi-l.O.zip 
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